Objectives

library(tidyverse)
library(knitr)
library(broom)
library(stringr)
library(modelr)
library(forcats)
library(ggraph)
library(igraph)

options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())

Network theory

A network (also sometimes referred to as a graph) is a set of relationships. Networks contain a set of objects (nodes/verticies) and a mapping or description of relations between the nodes (link/edge). For example, a simple network contains two objects, 1 and 3, and one relationship, or edge, that links them:

The above edge is an undirected edge - there is no directionality to the relationship between 1 and 2. A directed edge is an ordered pair of nodes, with an arrow drawn to indicate the directionality of the edge:

There are lots of ways to analyze and measure networks. Let’s first look at two examples of network analysis applied to real-life data, then circle back and talk about packages for network visualization and analysis in R.

Studying political polarization through networks

The Rise of Partisanship in the U.S. House of Representatives

There is little argument that political polarization is occurring in the United States. Partisanship can be attributed to many causes, including:

  • Income inequality
  • Gerrymandering/redistricting
  • Activists involved in primary elections
  • Political realignment in the South
  • Electing more partisan members to Congress
  • Shift by existing members to ideological poles
  • Increasingly partisan media

Research design

In this paper, the authors studying partisanship in the U.S. House of Representatives by examining relationships between legislators. On one hand, legislators are pressured by party leaders to vote with members of their own party with incentives such as committee assignments and campaign funding used to keep members in check. Alternatively, legislators have individual incentives to cooperate with members of the opposite political party (responsiveness to individual constituencies’ concerns).

Here, the authors measure the extent to which legislators form ideological relationships with members of the opposite party by examining cooperation rates between individual members of Congress on roll-call votes. In this research design, legislators are the nodes and the frequency of agreement on roll call votes are the edges. Edges are calculated for each pair of legislators serving in a two-year term of Congress, resulting in nearly 6,000,000 pairs of legislators. The data is represented in matrix form with the rows and columns identifying each legislator-term observation and the cells identifying the frequency of voting together in that term. Therefore the network/graph is undirected (it doesn’t matter the ordering of the pair, they will still have the same number of votes in agreement with one another).

Visualizing polarization

Pairs are defined as either cross-party (comprised of a single Republican and a single Democrat) or same-party (comprised of two Democrats or two Republicans). We can then think of partisan affiliation as an attribute of each node, and the cross- or same-party pairing being an attribute of each edge.

This figure from the article is a fairly typical visualization of network data known as a node-edge diagram. Nodes are drawn as point marks and the edges connecting them are drawn as line marks. Drawing node-edge diagrams is tricky for the simple reason that you need to determine where to draw each node on a two-dimensional coordinate plane. The problem is that there is no inherent or meaningful value of x-y coordinates. In this data, there is no attribute/variable that tells us where to draw these nodes.

Instead, node-edge diagrams rely on algorithms to determine spatial positioning in the visualization. There are many algorithms that can perform this task, but they typically consider connectivity and distance. Distance is defined as the number of edges along the shortest path connecting two nodes. Nodes that are tightly connected by shorter distances would therefore be grouped closer together in the visualization. Nodes that are loosely connected by many edges or links should be farther apart on the graph.

Force-directed placement

One of the most common network layouts is force-directed placement.

D3 example of a force-directed graph

In one variant, network elements are positioned according to a simulation of physical forces where nodes push away from each other while links act like springs that draw the endpoint nodes closer to each other. Typically this method places nodes randomly within the spatial region and iteratively refines the locations by gradually shifting nodes around until the layout improves and stabilizes.

Under this method, absolute spatial position does not encode any meaningful value. The algorithm is designed to minimize distracting artifacts that might confuse the viewer (e.g. edge crossings, node overlaps), so spatial location is merely a side effect. While absolute position is meaningless, relative spatial location can be meaningful. Tightly interconnected groups of nodes should be drawn relatively close together, which could indicate a substantive clustering. However this could also be an artifact of the algorithm. Alternative measures such as centrality are more robust measures of relative node importance in a network.

Because these algorithms start with random placement, in order to exactly reproduce a network visualization you should remember to set your random number seed at the beginning of the script (set.seed() in R).

In this graph, edges are drawn between legislators who agree above the Congress’ threshold value of votes - this is defined as the average level of cooperation within each Congress. If the authors did not do this, every pair with at least one vote in agreement would have an edge drawn, making the graph even more complex.

We also see that the authors encode additional information in the visualization through extra channels:

  • Partisan affiliation is indicated by coloring the nodes red (Republican) or blue (Democrat)
  • Each node is sized relative to its total connection, so larger nodes have more overall connections with other legislators
  • Edges are thicker if the pair agrees on more votes

Hairball node-edge diagrams

Hairball diagrams are those with so many nodes and edges that the diagram becomes a jumbled mess and interpretation becomes extremely difficult. A general rule of thumb is that if the number of nodes is more than 4 times the number of edges, straight-forward force-directed placement will not be optimal.

In this example, the individual network diagrams are pretty bad if you want to interpret them at the legislator level. For instance, consider the 2011 graph:

2011 rise of partisanship

Trying to track all the nodes and edges to identify legislators with the most ties (cooperative pairings) is downright impossible. There is just too much going on here. However, the visualization is very good at depicting the increasing partisanship in the U.S. House of Representatives. Democrats and Republicans are clustered together on the graph (partially the algorithm and partially the fact that legislators vote most frequently with members of their own party). It is easy to see that in the 1950s and 60s there were a lot of edges connecting legislators from both sides of the aisle. In fact we can even see more mixing of the nodes, where Republicans and Democrats are drawn more closely together on the grid. Over time, we can see both the number of cross-party edges decreasing and the spatial distance between the core Democrat and Republican clusters increasing, both outcomes of increased partisanship and decreasing cooperation.

Love Actually

Example adapted from David Robinson’s Analyzing networks of characters in ‘Love Actually’

Import data

First we use R to parse the raw script into a tidy data frame.

raw <- readLines("data/love_actually.txt")

lines <- data_frame(raw = raw) %>%
    filter(raw != "", !str_detect(raw, "(song)")) %>%
    mutate(is_scene = str_detect(raw, " Scene "),
           scene = cumsum(is_scene)) %>%
    filter(!is_scene) %>%
    separate(raw, c("speaker", "dialogue"), sep = ":", fill = "left") %>%
    group_by(scene, line = cumsum(!is.na(speaker))) %>%
    summarize(speaker = speaker[1], dialogue = str_c(dialogue, collapse = " "))

Next we match characters to actors:

cast <- read_csv(url("http://varianceexplained.org/files/love_actually_cast.csv"))

lines <- lines %>%
    inner_join(cast) %>%
    mutate(character = paste0(speaker, " (", actor, ")"))
lines[460:465,] %>%
  knitr::kable(caption = "Example of tidied data")
Example of tidied data
scene line speaker dialogue actor character
52 581 Sarah Nobody’s trying to kill you, babe. Laura Linney Sarah (Laura Linney)
52 582 Sarah Thank you. Don’t do that, my darling. Thank you. Don’t do that. Laura Linney Sarah (Laura Linney)
53 583 Harry Right. Back at three. Christmas shopping, never an easy or a pleasant task. Alan Rickman Harry (Alan Rickman)
53 584 Mia Are you going to get me something? Heike Makatsch Mia (Heike Makatsch)
53 585 Harry Er… I don’t know, I hadn’t thought. Where’s Sarah, by the way? Alan Rickman Harry (Alan Rickman)
53 586 Mia She couldn’t make it in today. Family thing. Heike Makatsch Mia (Heike Makatsch)

In order to analyze the network structure, we need to count first the lines-per-scene-per-character, then convert this into a binary speaker-by-scene matrix:

by_speaker_scene <- lines %>%
  count(scene, character)
by_speaker_scene
## Source: local data frame [162 x 3]
## Groups: scene [?]
## 
## # A tibble: 162 x 3
##    scene                character     n
##    <int>                    <chr> <int>
##  1     2       Billy (Bill Nighy)     5
##  2     2      Joe (Gregor Fisher)     3
##  3     3      Jamie (Colin Firth)     5
##  4     4     Daniel (Liam Neeson)     3
##  5     4    Karen (Emma Thompson)     6
##  6     5    Colin (Kris Marshall)     4
##  7     6    Jack (Martin Freeman)     2
##  8     6       Judy (Joanna Page)     1
##  9     7    Mark (Andrew Lincoln)     4
## 10     7 Peter (Chiwetel Ejiofor)     4
## # ... with 152 more rows
library(reshape2)
speaker_scene_matrix <- by_speaker_scene %>%
  acast(character ~ scene, fun.aggregate = length)

dim(speaker_scene_matrix)
## [1] 20 76

Hierarchical clustering

Hierarchical clustering is a technique for identifying clusters or subgroups in a data set. Dendrograms are a common visualization of these clusters. Observations which fuse into branches lower on the graph are generally similar to one another, whereas observations which fuse higher are less similar. Similarity is defined generally by distance between observations; here, defined by how many scenes each pair of characters shares.

norm <- speaker_scene_matrix / rowSums(speaker_scene_matrix)

h <- hclust(dist(norm, method = "manhattan"))

ggdendro::ggdendrogram(h)

This looks about right! Almost all the romantic pairs are together (Natalia/PM; Aurelia/Jamie, Harry/Karen; Karl/Sarah; Juliet/Peter; Jack/Judy) as are the friends (Colin/Tony; Billy/Joe) and family (Daniel/Sam).

One thing this tree is perfect for is giving an ordering that puts similar characters close together:

ordering <- h$labels[h$order]
ordering
##  [1] "Natalie (Martine McCutcheon)" "PM (Hugh Grant)"             
##  [3] "Aurelia (Lúcia Moniz)"        "Jamie (Colin Firth)"         
##  [5] "Daniel (Liam Neeson)"         "Sam (Thomas Sangster)"       
##  [7] "Jack (Martin Freeman)"        "Judy (Joanna Page)"          
##  [9] "Colin (Kris Marshall)"        "Tony (Abdul Salis)"          
## [11] "Billy (Bill Nighy)"           "Joe (Gregor Fisher)"         
## [13] "Mark (Andrew Lincoln)"        "Juliet (Keira Knightley)"    
## [15] "Peter (Chiwetel Ejiofor)"     "Karl (Rodrigo Santoro)"      
## [17] "Sarah (Laura Linney)"         "Mia (Heike Makatsch)"        
## [19] "Harry (Alan Rickman)"         "Karen (Emma Thompson)"

This ordering can be used to make other graphs more informative. For instance, we can visualize a timeline of all scenes:

scenes <- by_speaker_scene %>%
    filter(n() > 1) %>%        # scenes with > 1 character
    ungroup() %>%
    mutate(scene = as.numeric(factor(scene)),
           character = factor(character, levels = ordering))

ggplot(scenes, aes(scene, character)) +
    geom_point() +
    geom_path(aes(group = scene))

Adjacency matrix

Alternatively, network data can be visualized using a matrix view based on a table from the original network data. For example, an adjacency matrix is a matrix where the nodes of the network are lined up on the rows and columns, and the edges are encoded in the cell values. To visualize the edges, we draw a heatmap with color used to encode the edges.

Here we draw an adjacency matrix for the Love Actually data. To generate the adjacency matrix, all we need is the speaker-by-scene matrix. Recall that this matrix records whether a character is present in every scene using a binary 0/1 coding scheme. To generate an adjacency matrix, we simply multiply the matrix by its transpose. This identifies the number of scenes in which every pair of characters appear.

non_airport_scenes <- speaker_scene_matrix[, colSums(speaker_scene_matrix) < 10]
non_airport_scenes[1:5, 1:5]
##                       2 3 4 5 6
## Aurelia (Lúcia Moniz) 0 0 0 0 0
## Billy (Bill Nighy)    1 0 0 0 0
## Colin (Kris Marshall) 0 0 0 1 0
## Daniel (Liam Neeson)  0 0 1 0 0
## Harry (Alan Rickman)  0 0 0 0 0
cooccur <- non_airport_scenes %*% t(non_airport_scenes)
cooccur[1:5, 1:5]
##                       Aurelia (Lúcia Moniz) Billy (Bill Nighy)
## Aurelia (Lúcia Moniz)                     5                  0
## Billy (Bill Nighy)                        0                  6
## Colin (Kris Marshall)                     0                  0
## Daniel (Liam Neeson)                      0                  0
## Harry (Alan Rickman)                      0                  0
##                       Colin (Kris Marshall) Daniel (Liam Neeson)
## Aurelia (Lúcia Moniz)                     0                    0
## Billy (Bill Nighy)                        0                    0
## Colin (Kris Marshall)                     6                    0
## Daniel (Liam Neeson)                      0                   11
## Harry (Alan Rickman)                      0                    0
##                       Harry (Alan Rickman)
## Aurelia (Lúcia Moniz)                    0
## Billy (Bill Nighy)                       0
## Colin (Kris Marshall)                    0
## Daniel (Liam Neeson)                     0
## Harry (Alan Rickman)                    10

To visualize the adjacency matrix, we draw a heatmap. Note that when using the default heatmap() function from the stats library, dendrograms are also added to the left and top sides of the visualization:

heatmap(cooccur)

If you want to use ggplot(), you need to first convert the adjacency matrix into a tidy data frame, then plot that object (and you lose the dendrograms):

cooccur %>%
  as_tibble %>%
  mutate(id1 = rownames(cooccur)) %>%
  gather(id2, n, -id1) %>%
  mutate_at(vars(id1, id2), funs(factor(., levels = ordering))) %>%
  ggplot(aes(id1, id2, fill = n)) +
  geom_tile() +
  scale_fill_continuous(low = "white", high = "red") +
  coord_fixed() +
  labs(x = NULL,
       y = NULL,
       fill = NULL) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Node-edge diagram

To draw the network as a node-edge diagram, first we need to convert the adjacency matrix into an igraph object.1 Rather than storing it in a pure matrix form, the igraph object is a specialized data object in R for storing network data. Many graphing functions in R for network data rely on this structure.2

g <- graph_from_adjacency_matrix(cooccur,
                                 weighted = TRUE,
                                 mode = "undirected",
                                 diag = FALSE)
g
## IGRAPH UNW- 20 37 -- 
## + attr: name (v/c), weight (e/n)
## + edges (vertex names):
##  [1] Aurelia (Lúcia Moniz)--Jamie (Colin Firth)   
##  [2] Billy (Bill Nighy)   --Joe (Gregor Fisher)   
##  [3] Colin (Kris Marshall)--Mark (Andrew Lincoln) 
##  [4] Colin (Kris Marshall)--Tony (Abdul Salis)    
##  [5] Daniel (Liam Neeson) --Karen (Emma Thompson) 
##  [6] Daniel (Liam Neeson) --Sam (Thomas Sangster) 
##  [7] Harry (Alan Rickman) --Jamie (Colin Firth)   
##  [8] Harry (Alan Rickman) --Karen (Emma Thompson) 
## + ... omitted several edges

The first argument is the adjacency matrix, and the remaining arguments are:

  • weighted = TRUE - creates a weighted graph. Here the weights refer to the fact that some edges are larger than others (i.e. some characters appear together in more scenes than others). In an unweighted graph, edges are binary - either they exist between two nodes, or they do not.
  • mode = "undirected" - there is no directionality to the edges.
  • diag = FALSE - set the diagonal of the matrix to be zero. We want to ignore the fact that characters appear in different numbers of scenes overall (this is basically ignoring nodes linked to themselves, i.e. the raw number of scenes in which the character appears).

To draw the node-edge diagram, we can use several packages. Here I use ggraph, an extension of ggplot2 which incorporates relational data structures into the layered grammar of graphics foundation.

ggraph(g) + 
  geom_edge_link(aes(edge_width = weight)) + 
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  scale_edge_width_continuous(range = c(.5, 3)) +
  theme_graph() +
  theme(legend.position = "none")

Technical details about the visualization:

  • Edges are drawn based on the weights. That is, thicker lines indicate more scenes together and a stronger “tie” between the characters
  • I used the default algorithm for drawing the diagram, which is the nicely() algorithm from igraph. We will examine the output of additional algorithms later.
  • Nodes are labeled with the character’s name. This tends to add more noise to the graph, but makes interpretation about individual character arcs and relationships easier.

A few patterns pop out of this visualization. We see that the majority of characters are tightly connected (often by the scenes at the school play, or by Karen (Emma Thompson), who is friends or family to many key characters). But we see Bill Nighy’s plotline occurs almost entirely separate from everyone else, and that five other characters are linked to the main network by only a single thread (Sarah’s conversation with Mark at the wedding).

ggraph for network visualizations

Let’s explore in more detail the different components for drawing graphs using ggraph. Again, this is not the only approach in R. In fact, the package was only published in the past few months and is therefore still building in support and usage. However it is pretty thoroughly documented and adheres to the grammar of graphics philosophy. Other packages for network visualization in R include:

  • igraph - the workhorse package in R. ggraph and other visualization tools heavily rely on this packages core functions for generating layouts, creating data structures, and other backend processing tasks. igraph also includes loads of functions for statistical analysis of networks (measures of density, reciprocity, centrality, distance, etc.)
  • ggnet2 - another package using the ggplot2 syntax
  • ggnetwork - yet another package using ggplot2, in fact from the same author of ggnet2. This package retains the same common functions in ggplot2 to add geom_()s to the visualization. It most resembles standard ggplot2 code.

The layout

The layout defines the vertical and horizontal placement of nodes when plotting a graph structure. The layout algorithm takes in a graph structure and returns the vertical and horizontal position of the nodes. Different algorithms (and in fact different iterations) return different spatial positions:

ggraph(g) + 
  geom_edge_link(aes(edge_width = weight)) + 
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  scale_edge_width_continuous(range = c(.5, 3)) +
  theme_graph() +
  theme(legend.position = "none") +
  ggtitle("Default layout (Nicely) algorithm")

ggraph(g, layout = "kk") + 
  geom_edge_link(aes(edge_width = weight)) + 
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  scale_edge_width_continuous(range = c(.5, 3)) +
  theme_graph() +
  theme(legend.position = "none") +
  ggtitle("Kamada and Kawai spring-based algorithm")

ggraph(g, layout = "fr") + 
  geom_edge_link(aes(edge_width = weight)) + 
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  scale_edge_width_continuous(range = c(.5, 3)) +
  theme_graph() +
  theme(legend.position = "none") +
  labs(title = "Fruchterman-Reingold algorithm",
       subtitle = "Force-directed layout, 500 iterations")

ggraph(g, layout = "fr", niter = 100) + 
  geom_edge_link(aes(edge_width = weight)) + 
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  scale_edge_width_continuous(range = c(.5, 3)) +
  theme_graph() +
  theme(legend.position = "none") +
  labs(title = "Fruchterman-Reingold algorithm",
       subtitle = "Force-directed layout, 100 iterations")

ggraph(g, layout = "fr", niter = 10) + 
  geom_edge_link(aes(edge_width = weight)) + 
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  scale_edge_width_continuous(range = c(.5, 3)) +
  theme_graph() +
  theme(legend.position = "none") +
  labs(title = "Fruchterman-Reingold algorithm",
       subtitle = "Force-directed layout, 10 iterations")

ggraph(g, layout = "grid") + 
  geom_edge_link(aes(edge_width = weight)) + 
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  scale_edge_width_continuous(range = c(.5, 3)) +
  theme_graph() +
  theme(legend.position = "none") +
  ggtitle("Grid algorithm")

ggraph(g, layout = "star") + 
  geom_edge_link(aes(edge_width = weight)) + 
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  scale_edge_width_continuous(range = c(.5, 3)) +
  theme_graph() +
  theme(legend.position = "none") +
  ggtitle("Star algorithm")

ggraph(g, layout = "linear") + 
  geom_edge_arc(aes(edge_width = weight)) + 
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  scale_edge_width_continuous(range = c(.5, 3)) +
  theme_graph() +
  theme(legend.position = "none") +
  ggtitle("Linear algorithm")

ggraph(g, layout = "linear", circular = TRUE) + 
  geom_edge_arc(aes(edge_width = weight)) + 
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  scale_edge_width_continuous(range = c(.5, 3)) +
  theme_graph() +
  theme(legend.position = "none") +
  ggtitle("Star algorithm (circular)")

Possible layouts in igraph

Dendrograms

dendrogram <- as.dendrogram(hclust(dist(iris[, 1:4])))

ggraph(dendrogram, 'dendrogram') + 
  geom_edge_elbow()

ggraph(dendrogram, 'dendrogram', circular = TRUE) + 
  geom_edge_elbow() + 
  coord_fixed()

Nodes

Remember that nodes are the entities that are connected in a network. Nodes represent entities, and the layout is their physical placement. Once we have their spatial placement from the layout, drawing them is basically like drawing a scatterplot:

ggraph(g, layout = "kk") +
  geom_point(aes(x = x, y = y))

ggraph(g, layout = "kk") + 
  geom_edge_link() + 
  geom_node_point()

This works because the result of ggraph() and create_layout() is a data frame:

create_layout(g, layout = "kk")
##          x       y                         name ggraph.orig_index circular
## 1  -2.1417  1.3424        Aurelia (Lúcia Moniz)                 1    FALSE
## 2  -4.1922 -1.9677           Billy (Bill Nighy)                 2    FALSE
## 3   0.2076 -1.0916        Colin (Kris Marshall)                 3    FALSE
## 4   0.7592 -0.0822         Daniel (Liam Neeson)                 4    FALSE
## 5  -0.6400  0.1823         Harry (Alan Rickman)                 5    FALSE
## 6  -0.0503  0.6976        Jack (Martin Freeman)                 6    FALSE
## 7  -0.5979 -0.1755          Jamie (Colin Firth)                 7    FALSE
## 8  -4.4812 -1.0155          Joe (Gregor Fisher)                 8    FALSE
## 9  -0.2101  0.6793           Judy (Joanna Page)                 9    FALSE
## 10 -1.1259 -2.0090     Juliet (Keira Knightley)                10    FALSE
## 11 -0.1101 -0.1096        Karen (Emma Thompson)                11    FALSE
## 12 -0.4277 -0.0871       Karl (Rodrigo Santoro)                12    FALSE
## 13 -0.0347 -0.8263        Mark (Andrew Lincoln)                13    FALSE
## 14 -0.2749  0.1918         Mia (Heike Makatsch)                14    FALSE
## 15  0.0147  0.2879 Natalie (Martine McCutcheon)                15    FALSE
## 16 -0.2896 -2.3068     Peter (Chiwetel Ejiofor)                16    FALSE
## 17 -0.3799  0.4961              PM (Hugh Grant)                17    FALSE
## 18  0.0907  0.5504        Sam (Thomas Sangster)                18    FALSE
## 19 -0.0635 -0.4586         Sarah (Laura Linney)                19    FALSE
## 20  1.4496 -1.9517           Tony (Abdul Salis)                20    FALSE
##    ggraph.index
## 1             1
## 2             2
## 3             3
## 4             4
## 5             5
## 6             6
## 7             7
## 8             8
## 9             9
## 10           10
## 11           11
## 12           12
## 13           13
## 14           14
## 15           15
## 16           16
## 17           17
## 18           18
## 19           19
## 20           20

The geom_node_*() family of functions are essentially direct translations from geom_*(), but retain the network specific flavor. In addition, they explicitly assume the \(x\) and \(y\) coordinates are stored in x and y, so they do not need to be declared in the function call.

The three core functions are geom_node_point(), geom_node_text(), and geom_node_label():

ggraph(g, layout = "kk") + 
  geom_edge_link() + 
  geom_node_point()

ggraph(g, layout = "kk") + 
  geom_edge_link() + 
  geom_node_text(aes(label = name))

ggraph(g, layout = "kk") + 
  geom_edge_link() + 
  geom_node_label(aes(label = name))

ggraph(g, layout = "kk") + 
  geom_edge_link() + 
  geom_node_label(aes(label = name), repel = TRUE)

Communicating attributes of nodes

# example files obtained from http://kateto.net/network-visualization

nodes <- read_csv("data/Dataset1-Media-Example-NODES.csv")
links <- read_csv("data/Dataset1-Media-Example-EDGES.csv")

links <- aggregate(links[,3], links[,-3], sum)
links <- links[order(links$from, links$to),]

net <- graph_from_data_frame(d = links, vertices = nodes, directed = TRUE)
net
## IGRAPH DNW- 17 49 -- 
## + attr: name (v/c), media (v/c), media.type (v/n), type.label
## | (v/c), audience.size (v/n), type (e/c), weight (e/n)
## + edges (vertex names):
##  [1] s01->s02 s01->s03 s01->s04 s01->s15 s02->s01 s02->s03 s02->s09
##  [8] s02->s10 s03->s01 s03->s04 s03->s05 s03->s08 s03->s10 s03->s11
## [15] s03->s12 s04->s03 s04->s06 s04->s11 s04->s12 s04->s17 s05->s01
## [22] s05->s02 s05->s09 s05->s15 s06->s06 s06->s16 s06->s17 s07->s03
## [29] s07->s08 s07->s10 s07->s14 s08->s03 s08->s07 s08->s09 s09->s10
## [36] s10->s03 s12->s06 s12->s13 s12->s14 s13->s12 s13->s17 s14->s11
## [43] s14->s13 s15->s01 s15->s04 s15->s06 s16->s06 s16->s17 s17->s04

We can use standard aesthetics of ggplot2 to communicate attributes of nodes. For instance, consider this basic node-edge diagram about media companies:

ggraph(net) +
  geom_edge_link() +
  geom_node_point() +
  geom_node_label(aes(label = media), repel = TRUE) +
  theme_graph()

We can add more information about the nodes via aes(). For instance, we know what type of media company each one is (newspaper, TV, or online). We could add this using the color aesthetic:

ggraph(net) +
  geom_edge_link() +
  geom_node_point(aes(color = type.label), size = 2) +
  geom_node_label(aes(label = media), repel = TRUE, size = 2) +
  theme_graph()

We also know the approximate audience size of each media company. We could add this information via the size aesthetic:

ggraph(net) +
  geom_edge_link() +
  geom_node_point(aes(color = type.label, size = audience.size)) +
  geom_node_label(aes(label = media), repel = TRUE, size = 2) +
  theme_graph()

Edges

Edges simply denote the relationship between two entities or nodes. A straight line is one possible method for visualizing this relationship, but it is not the only way. geom_edge_link() will however provide straight lines for edges:

ggraph(net) +
  geom_edge_link() +
  geom_node_point(aes(color = type.label, size = audience.size)) +
  geom_node_label(aes(label = media), repel = TRUE, size = 2) +
  theme_graph()

However you can use other approaches, such as a fan:

ggraph(net) +
  geom_edge_fan() +
  geom_node_point(aes(color = type.label, size = audience.size)) +
  geom_node_label(aes(label = media), repel = TRUE, size = 2) +
  theme_graph()

When edges run in parallel (i.e. multiple edges between the same two nodes), then ggraph spreads them out as arcs with different curvatures. Because the media example is a directed graph, edges can run in both directions which is why some of the nodes have multiple edges.

Decorating edges

Edges can also have attributes mapped to them. We’ve already seen this by assigning edge thickness based on a given weighting attribute. We can also encode directionality, as in the media example, with the use of arrows:

ggraph(net) +
  geom_edge_fan(arrow = arrow(length = unit(4, 'mm'))) +
  geom_node_point(aes(color = type.label, size = audience.size)) +
  geom_node_label(aes(label = media), repel = TRUE, size = 2) +
  theme_graph()

Unfortunately by default the arrows extend through the center of the node. To correct this use start_cap and end_cap aesthetics to change the start and stop of the edge:

ggraph(net) +
  geom_edge_fan(arrow = arrow(length = unit(4, 'mm')), 
                   end_cap = circle(3, 'mm')) +
  geom_node_point(aes(color = type.label, size = audience.size)) +
  geom_node_label(aes(label = media), repel = TRUE, size = 2) +
  theme_graph()

And like with nodes, we can also color-code the edges. For instance, here we can color the edge based on if the relationship is from a hyperlink or a mention:

ggraph(net) +
  geom_edge_fan(arrow = arrow(length = unit(4, 'mm')), 
                   end_cap = circle(3, 'mm'),
                aes(color = type)) +
  geom_node_point(aes(color = type.label, size = audience.size)) +
  geom_node_label(aes(label = media), repel = TRUE, size = 2) +
  theme_graph()

To change the color palettes, use scales_color_*() like normal:

ggraph(net) +
  geom_edge_fan(arrow = arrow(length = unit(4, 'mm')), 
                   end_cap = circle(3, 'mm'),
                aes(color = type)) +
  geom_node_point(aes(color = type.label, size = audience.size)) +
  geom_node_label(aes(label = media), repel = TRUE, size = 2) +
  theme_graph() +
  scale_color_brewer(palette = "Paired") +
  scale_edge_color_brewer(palette = "Dark2")

Session Info

devtools::session_info()
##  setting  value                       
##  version  R version 3.3.3 (2017-03-06)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2017-05-15                  
## 
##  package     * version    date       source                            
##  assertthat    0.2.0      2017-04-11 cran (@0.2.0)                     
##  backports     1.0.5      2017-01-18 CRAN (R 3.3.2)                    
##  base        * 3.3.3      2017-03-07 local                             
##  broom       * 0.4.2      2017-02-13 CRAN (R 3.3.2)                    
##  cellranger    1.1.0      2016-07-27 CRAN (R 3.3.0)                    
##  colorspace    1.3-2      2016-12-14 CRAN (R 3.3.2)                    
##  datasets    * 3.3.3      2017-03-07 local                             
##  DBI           0.6-1      2017-04-01 CRAN (R 3.3.2)                    
##  devtools      1.13.0     2017-05-08 CRAN (R 3.3.2)                    
##  digest        0.6.12     2017-01-27 CRAN (R 3.3.2)                    
##  dplyr       * 0.5.0      2016-06-24 CRAN (R 3.3.0)                    
##  evaluate      0.10       2016-10-11 CRAN (R 3.3.0)                    
##  forcats     * 0.2.0      2017-01-23 CRAN (R 3.3.2)                    
##  foreign       0.8-68     2017-04-24 CRAN (R 3.3.2)                    
##  ggforce       0.1.1      2016-11-28 CRAN (R 3.3.2)                    
##  ggplot2     * 2.2.1.9000 2017-05-12 Github (tidyverse/ggplot2@f4398b6)
##  ggraph      * 1.0.0      2017-02-24 CRAN (R 3.3.2)                    
##  ggrepel       0.6.5      2016-11-24 CRAN (R 3.3.2)                    
##  graphics    * 3.3.3      2017-03-07 local                             
##  grDevices   * 3.3.3      2017-03-07 local                             
##  grid          3.3.3      2017-03-07 local                             
##  gridExtra     2.2.1      2016-02-29 cran (@2.2.1)                     
##  gtable        0.2.0      2016-02-26 CRAN (R 3.3.0)                    
##  haven         1.0.0      2016-09-23 cran (@1.0.0)                     
##  hms           0.3        2016-11-22 CRAN (R 3.3.2)                    
##  htmltools     0.3.6      2017-04-28 cran (@0.3.6)                     
##  httr          1.2.1      2016-07-03 CRAN (R 3.3.0)                    
##  igraph      * 1.0.1      2015-06-26 CRAN (R 3.3.0)                    
##  jsonlite      1.4        2017-04-08 cran (@1.4)                       
##  knitr       * 1.15.1     2016-11-22 cran (@1.15.1)                    
##  lattice       0.20-35    2017-03-25 CRAN (R 3.3.2)                    
##  lazyeval      0.2.0      2016-06-12 CRAN (R 3.3.0)                    
##  lubridate     1.6.0      2016-09-13 CRAN (R 3.3.0)                    
##  magrittr      1.5        2014-11-22 CRAN (R 3.3.0)                    
##  MASS          7.3-47     2017-04-21 CRAN (R 3.3.2)                    
##  memoise       1.1.0      2017-04-21 CRAN (R 3.3.2)                    
##  methods     * 3.3.3      2017-03-07 local                             
##  mnormt        1.5-5      2016-10-15 CRAN (R 3.3.0)                    
##  modelr      * 0.1.0      2016-08-31 CRAN (R 3.3.0)                    
##  munsell       0.4.3      2016-02-13 CRAN (R 3.3.0)                    
##  nlme          3.1-131    2017-02-06 CRAN (R 3.3.3)                    
##  parallel      3.3.3      2017-03-07 local                             
##  plyr          1.8.4      2016-06-08 CRAN (R 3.3.0)                    
##  psych         1.7.5      2017-05-03 CRAN (R 3.3.3)                    
##  purrr       * 0.2.2.2    2017-05-11 CRAN (R 3.3.3)                    
##  R6            2.2.1      2017-05-10 CRAN (R 3.3.2)                    
##  Rcpp          0.12.10    2017-03-19 cran (@0.12.10)                   
##  readr       * 1.1.0      2017-03-22 cran (@1.1.0)                     
##  readxl        1.0.0      2017-04-18 CRAN (R 3.3.2)                    
##  reshape2      1.4.2      2016-10-22 CRAN (R 3.3.0)                    
##  rlang         0.1.9000   2017-05-12 Github (hadley/rlang@c17568e)     
##  rmarkdown     1.5        2017-04-26 CRAN (R 3.3.2)                    
##  rprojroot     1.2        2017-01-16 CRAN (R 3.3.2)                    
##  rvest         0.3.2      2016-06-17 CRAN (R 3.3.0)                    
##  scales        0.4.1      2016-11-09 CRAN (R 3.3.1)                    
##  stats       * 3.3.3      2017-03-07 local                             
##  stringi       1.1.5      2017-04-07 CRAN (R 3.3.2)                    
##  stringr     * 1.2.0      2017-02-18 CRAN (R 3.3.2)                    
##  tibble      * 1.3.0.9002 2017-05-12 Github (tidyverse/tibble@9103a30) 
##  tidyr       * 0.6.2      2017-05-04 CRAN (R 3.3.2)                    
##  tidyverse   * 1.1.1      2017-01-27 CRAN (R 3.3.2)                    
##  tools         3.3.3      2017-03-07 local                             
##  tweenr        0.1.5      2016-10-10 CRAN (R 3.3.0)                    
##  udunits2      0.13       2016-11-17 CRAN (R 3.3.2)                    
##  units         0.4-4      2017-04-20 CRAN (R 3.3.2)                    
##  utils       * 3.3.3      2017-03-07 local                             
##  viridis       0.4.0      2017-03-27 CRAN (R 3.3.2)                    
##  viridisLite   0.2.0      2017-03-24 cran (@0.2.0)                     
##  withr         1.0.2      2016-06-20 CRAN (R 3.3.0)                    
##  xml2          1.1.1      2017-01-24 CRAN (R 3.3.2)                    
##  yaml          2.1.14     2016-11-12 cran (@2.1.14)

  1. igraph is one of the most important packages in R for graph and network analysis.

  2. We could also generate the igraph from the original data frame using graph_from_data_frame().